Distribution Free Prediction Bands 

Jing Lei and Larry Wasserman 

Carnegie Mellon University 

March 1, 2013 



(N 

o 

(N 



(N 



w 



•4— > 



> 

(N 

cn 
O 
CN 



X 



Abstract 

We study distribution free, nonparametric prediction bands with a special focus 
on their finite sample behavior. First we investigate and develop different notions of 
finite sample coverage guarantees. Then we give a new prediction band estimator by 



combining the idea of "conformal prediction" (Vovk et al. , 2009 ) with nonparametric 



conditional density estimation. The proposed estimator, called COPS (Conformal 
Optimized Prediction Set), always has finite sample guarantee in a stronger sense than 
the original conformal prediction estimator. Under regularity conditions the estimator 
converges to an oracle band at a minimax optimal rate. A fast approximation algorithm 
and a data driven method for selecting the bandwidth are developed. The method 
is illustrated first in simulated data. Then, an application shows that the proposed 
method gives desirable prediction intervals in an automatic way, as compared to the 
classical linear regression modeling. 

1 Introduction 

Given observations (Xi,Yi) e M. d x M 1 for i = 1, ,..,n, we want to predict Y n+ \ given 
future predictor X n+ \. Unlike typical nonparametric regression methods, our goal is not to 
produce a point prediction. Instead, we construct a prediction interval C n that contains 
Y n+1 with probability at least 1 — a. More precisely, assume that (X 1; Yi), • ■ ■ (X n+1 , Y n+ i) 
are iid observations from some distribution P. We construct, from the first n sample points, 
a set- valued function 



C n (x) = C n (Xi,Yi, 



• -<*-m ini 



x) C 



(1) 



such that the next response variable Y n+ \ falls inside C n (X n+ i) with a certain level of confi- 
dence. The collection of prediction sets C n = {C n (x) : leR 1 *} forms a prediction band. 

The prediction set C n (x) depends on the observed value X n+ i = x, which shall be 
interpreted as the estimated set that Y is likely to fall in, given X n+ i = x. This extends 
nonparametric regression by providing a prediction set for each x. Such a prediction set 
provides useful information about the uncertainty. The problem of prediction intervals is 
well studied in the context of linear regression, where prediction intervals are constructed 



under linear and Gaussian assumptions (see, DeGroot & Schervish (2012), Theorem 11.3.6). 



The Gaussian assumption can be relaxed using, for example, quantile regression (Koenker 



& Hallock, 2001). These linear model based methods usually have reasonable finite sample 



performance. However, the coverage is valid only when the linear (or other parametric) 
regression model is correctly specified. On the other hand, nonparametric methods have the 



potential to work for any smooth distribution (Ruppert et al. (2003)) but only asymptotic 
results are available and the finite sample behavior remains unclear. 



Recently, Vovk et al. (2009) introduce a generic approach, called conformal prediction, to 
construct valid, distribution free, sequential prediction sets. When adapted to our setting, 
this yields prediction bands with a finite sample coverage guarantee in the sense that 

P [Y n+1 G C n (X n+1 )] > 1 - a for all P, (2) 

where P = P n+1 is the joint measure of (X 1 , Yi), ■ ■ ■ (X n+1 , Y n+1 ). However, the conditional 
coverage and statistical efficiency of such bands are not investigated. 



In this paper we extend the results in Vovk et al. (2009 ) and study conditional coverage as 



well as efficiency. We show that although finite sample coverage defined in p| is a desirable 
property, this is not enough to guarantee good prediction bands. We argue that the finite 
sample coverage given by (pi) should be interpreted as marginal coverage, which is different 
from (in fact, weaker than) the conditional coverage as usually sought in prediction problems. 
Requiring only marginal validity may lead to unsatisfactory estimation even in very simple 
cases. As a result, a good estimator must satisfy something more than marginal coverage. A 
natural criterion would be conditional coverage. However, we prove that conditional coverage 
is impossible to achieve with a finite sample. As an alternative solution, we develop a new 
notion, called local validity, that interpolates between marginal and conditional validity, and 
is achievable with a finite sample. This notion leads to our proposed estimator: COPS 
(Conformal Optimized Prediction Set). We also show that when the sample size goes to 
infinity, under regularity conditions, the locally valid prediction band given by COPS can 
give arbitrarily accurate conditional coverage, leading to an asymptotic conditional coverage 
guarantee. 

Another contribution of this paper is the study of efficiency in the context of nonpara- 
metric prediction bands. Roughly speaking, efficiency requires a prediction band to be small 
while maintaining the desired probability coverage in the sense of pi). We study the effi- 
ciency of our estimator by measuring its deviation from an oracle band, the band one should 
use if the joint distribution P were known. We also give a minimax lower bound on the 
estimation error so that the efficiency of our method is indeed minimax rate optimal over a 
certain class of smooth distributions. 

To summarize, the method given in this paper is the first one with both finite sample 
(marginal and local) coverage, asymptotic conditional coverage, and an explicit rate for 
asymptotic efficiency. The finite sample marginal and local validity is distribution free: no 
assumptions on P are required; P need not even have a density. Asymptotic conditional 
validity and efficiency are closely related and rely on some standard regularity conditions on 
the density. Furthermore, all tuning parameters are completely data-driven. 

The problem of constructing prediction bands resembles that of nonparametric confidence 
band estimation for the regression function m(x) = K(Y\X = x). However, these are two 
different inference problems. First note that non-trivial, distribution-free confidence bands 



for the regression function m(x) = K(Y\X = x) do not exist (Low, 1997 Genovese & 



Wasserman, 2008). On the other hand, in this paper we show that consistent prediction 
bands estimation is possible under mild regularity conditions. Hence there is a distinct 
difference between confidence bands for the regression function and prediction bands. 



Prior Work On Nonparametric Prediction Bands. The usual nonparametric prediction 
interval takes the form 

mix) ± z a / 2 y/v 2 + s 2 (3) 

where rh is some nonparametric regression estimator, a 2 is an estimate of Var(V|X), s is 
an estimate of the standard error of m and z a / 2 is either a Normal quantile or a quantile 
determined by bootstrapping. See, for example, Section 6.2 of Ruppert et al. (2003), Section 



2.3.3 of Loader (1999) and Chapter 5 of Fan & Gijbels (1996). The assumption of constant 
variance can be relaxed; see, for example, Akritas & Van Keilegom (2001). Other related 



work includes Hall & Rieck (2001) on bootstrapping, Davidian & Carroll (1987) on variance 



estimation and Carroll & Ruppert (1991) on transformation approaches. However, none of 
these methods yields prediction bands with distribution free, finite sample validity. Fur- 
thermore, these methods always produce a prediction set in the form of an interval which, 
as we shall see, may not be optimal. In fact, we are not aware of any paper that provides 
distribution free finite sample prediction bands with asymptotic optimality properties as we 
provide in this paper. The only paper we know of that provides finite sample marginal 



validity is the very interesting paper by Vovk et al. (2009). However, that paper focuses on 



linear predictors and does not address efficiency or conditional validity. 

Outline. In Section [2] we introduce various notions of validity and efficiency. In Section 
[3] we introduce our methods for prediction bands: the COPS estimator. We study the large 
sample and minimax results of the method in Section |4j We discuss bandwidth selection in 
Section [5} Section [6] contains some examples. Finally, concluding remarks are in Section [7j 

2 Marginal, Conditional, and Local Validity 

2.1 Marginal Validity and Prediction Sets 

Prediction bands are an extension of nonparametric prediction sets (also called tolerance 
regions). Suppose we observe n iid copies Z\, . . . , Z n of a random vector Z with distribution P 
and we want a set T n C R d such that P [Z n+ \ e T n ] > 1 — a for all P. Let Z{ = (JQ, Y$). Since 
the probability statement in ^ is over the joint distribution of (X l7 Yi), . . . , (X n+1 , Y n+1 ), 
it is equivalent to 



P [(X n+1 , Y n+1 ) eC n ]>l-a, for all P. (4) 

That is, equation Q is exactly the definition of a prediction set for the joint distribution 
(X, Y). As a result, any prediction set for the joint distribution provides a solution, with 
finite sample coverage, to the prediction band problem. In this subsection we pursue this 
point further. In the following subsections we consider improvements. 



The study of prediction sets dates back to Wilks (1941), Wald (1943), and Tukey (1947). 



More recently, the research on prediction sets has focused on finding statistically efficient 
estimators in multivariate cases (Chatterjee & Patra 1980 Di Bucchianico et al. , 2001 Li 



& Liu, 2008) 



Lei et al. (2011) study distribution free, finite sample valid and efficient 



estimator of prediction sets. A thorough introduction to prediction sets can be found in 



Krishnamoorthy & Mathew (2009). 
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Figure 1: Joint prediction set and pointwise conditional coverage for bivariate independent 
Gaussian. Left panel: the gray area is the optimal (with smallest Lebesgue measure) pre- 
diction set with coverage 0.9, the two red lines are the upper and lower 5% quantiles of the 
marginal distribution of Y . Right panel: the blue curve plots P(Y G C(x)\X = x) against 
x; the red line is the desired coverage level 0.9. 



There are many different methods to construct prediction sets. A common measure of 
efficiency is the Lebesgue measure and the optimal prediction set is the one with smallest 
Lebesgue measure among all sets with the desired coverage level. It is well-known that the 
optimal prediction set at level 1 — a (optimal refers to the one having smallest Lebesgue 
measure) is an upper level set of the joint density: 



CM = l(x,y): P (x,y)>tM 



(5) 



where t^ a ' is chosen such hat P{C^) = 1 — a. As illustrated in the following example, an 
optimal joint prediction can lead to an unsatisfactory prediction band. 

Figure II] shows the case of a bivariate independent Gaussian. According to ([5]), when 
X, Y are independent standard normals, the level set for any C^*' is a circle centered at 
the origin as described by the gray area in the left panel of Figure [TJ But intuitively since 
observing X provides no information about Y, the best prediction band at level a should be 
C(x) = [—z a /2, Za/2], f° r a h x, where z T is the r-th upper quantile of standard normal. This 
band is the set between two red dashed lines in the left panel of Figure [T] for a = 0.1. 

In prediction, another important notion of coverage is the conditional coverage P(Y G 
C(x)\X = x). The pointwise conditional coverage P(Y G C(x)\X = x) is plotted in the 
right panel of Figure [l] for the joint prediction set (blue curve). We see that the "optimal" 
joint prediction set tends to overestimate the set when x is in the high density area and to 
underestimate for low density x. Let us now consider conditional validity in more detail. 



2.2 Conditional Validity 

Only requiring §2§ for prediction bands is not enough. We will refer to pj) as marginal 
validity or joint validity. This is the type of validity used in Shafer & Vovk (2008). As 



illustrated in the example above, it may be tempting to insist on a more stringent probability 
guarantee such as 



F(Y n+ i G C n (x)\X n+ i = x) > 1 — a for all P and almost all x, 



(6) 



which we call conditional validity. If the joint distribution of (X, Y) is known, one can define 
an oracle band as the counterpart of pi) for conditionally valid bands: 



C> 



>(x) = {y.p(y\x)>&Xx)} 



(7) 



where t' a '(s) satisfies 



R{p(y\x) > t {a \x)} p(y\x)dy = 1 



a. 



We call Cp = {Cp(x) : x G M. d } the conditional oracle band. It is easy to prove that Cp 
minimizes fi[C(x)] for all x among all bands satisfying inf x P(V G C(x)\X = x) > 1 — a. 
Note that Cp depends on P but does not depend on the observed data. For an estimator 
C, asymptotic efficiency requires C(x) be close to Cp(x) uniformly over all x: 



sup/i 



C(x)AC P (x) 



4o. 



However, we will show that there do not exist any prediction bands C that satisfy both (|6 
and d8J). In fact, the following claim, proved in Subsection 8.2, is even stronger. 

Let Px denote the marginal distribution of X under P. A point x is a non-atom for P 
if x is in the support of Px and if Px[B(x, 8)] — y as 6 — > 0, where B(x, 5) is the Euclidean 
ball centered at x with radius S. Let N(P) denote the set of non-atoms. We show that if 
C n is conditionally valid then the length of C n (x) is infinite for all x G N(P). 

Lemma 1 (Impossibility of non-trivial finite sample conditional validity). Suppose that an 
estimator C n has 1 — a conditional validity. For any P and any xq G N(P), 



P[ lim ess sup fj,[C n (x)} 

5 ^°||a;o-a;||<<5 



OG 



1. 



Thus, non-trivial finite sample conditional validity is impossible for continuous distribu- 
tions. We shall instead construct prediction bands with an asymptotic version of ([6]) together 
with finite sample marginal validity. We say that C is asymptotically conditionally valid if 



sup 



p(y n+ i i c n { x )\x n 



+i 



x 



a 



4o 



(9) 



as n — > oo. Here, the supremum is taken over the support of Px- We note that if the 
conditional density p(y\x) is uniformly bounded for all (x,y), then asymptotic conditional 
validity is a consequence of asymptotic efficiency defined as in (J8]) . 

In Section [3] we construct a prediction band that satisfies: 

1. finite sample marginal validity, 

2. asymptotic conditional validity and 

3. asymptotic efficiency 

Our method is based on the notion of local validity, which naturally interpolates between 
marginal and conditional validity. 

Definition 2 (Local validity). Let A = {Aj : j > 1} be a partition of supp(Px) such that 
each Aj has diameter at most 5. A prediction band C n is locally valid with respect to A if 

F(Y n+1 e C n (X n+1 )\X n+l e A 3 ) > I - a, for all j and all P. (10) 

Remark. From the insight of Lemma [TJ it is possible to construct finite sample locally 
valid prediction sets because X G Aj is an event with positive probability and hence repeated 
observations are available. 

Remark. Consider the limiting case of 5 — > oo, which can be thought as having A\ = 
supp(Px), and local validity becomes marginal validity. On the other hand, in the extremal 
case 5 — > 0, Aj shrinks to a single point x G M. d , and local validity approximates conditional 
validity. We also note that local validity is stronger than marginal validity but weaker 
than conditional validity. We state the following proposition whose proof is elementary and 
omitted. 

Proposition 3. // C is conditionally valid, then it is also locally valid for any partition A. 
If C is locally valid for some partition A, then it is also marginally valid. 

The relationship between local validity and asymptotic conditional validity is more com- 
plicated and is one of the technical contributions of this paper. In Section [3] we construct 
a specific class of locally valid bands. In Theorem [9] of Section [4] we show that under mild 
regularity conditions, these bands are also asymptotically conditionally valid. To summarize, 
if C is locally valid then it is also marginally valid. And under regularity conditions, it can 
also be asymptotically conditionally valid. See Figure [2] 

How can we construct finite sample locally valid prediction bands? A straightforward 



approach is to apply the method developed in Lei et al. (2011) to Pj = C(X,Y\X G Aj), 
the joint distribution of (X,Y) conditional on the event X G Aj. Note that we are mostly 
interested in the case max,- diam(A,-) — > 0, therefore the marginal density of X within Pj 
becomes increasingly close to uniform. Therefore, the approach can be simplified to finding 
Cj G M 1 , such that P(Y G Cj\X G Aj) > 1 — a. This approach is detailed in Section^ and 
analyzed in Section |4j 
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Conditional Validity 



Local Validity 



Figure 2: Relationship between different types of validity. 



3 Methodology 

3.1 A Marginally Valid Prediction Band 

We start by recalling the construction of joint prediction sets using kernel density together 
with the idea of conformal prediction, as described in Lei et al. (2011), using the idea of 



conformal prediction developed in Shafer & Vovk (2008), Vovk et al. (2005) and Vovk et al. 



(2009). This approach is shown to have finite sample validity as well as asymptotic efficiency 
under regularity conditions. Suppose we observe 



Zu 



Z n ~ 



and we want a prediction set for Z n+ \. The idea is to test H : Z n+ i = z for each z and 
then invert the test. Specifically, for any z let p^(-) be a density estimator based on the 
augmented data aug(Z; z) = (Z 1; . . . , Z n , z). Define 



C n = C n (Zi, ...,Z n ) = {z: ir n (z) > a} 



where 



7T„. (z 



n 



1 n+l 



(z) < cr n+1 {z)) 



is the p-value for the test, <J%(z) = P^(Zi) for i = l,...,n and a n+ i(z) = p^(z). The 
statistic Oi is an example of a conformity measure. More generally, a conformity measure 
(Ti(z) = cr(aug(Z,z), Zi) indicates how well a data point Zi agrees with the augmented data 
set aug(Z,z). In principle er(-, •) can be any function but usually it makes sense to use the 
fitted residual or likelihood at Z± with respect to a model estimated from aug(Z, z). 

The intuition for C n is the following. Fix an arbitrary value z. To test Hq : Z n+ \ = z 
we use the heights of the density estimators a%{z) = P^,{Zi) as a test-statistic. (Note that 
oi, . . . , cr ra+ i are functions of aug(Z, z).) Under H , the ranks of the <Tj are uniform, because 
the joint distribution of (Zi, ..., Z n , Z n+ i) does not change under permutations. Hence the 
vector (<7i, ..., cr n+ i) is exchangeable. Therefore, under H Q , ir n (z) is uniformly distributed over 
[0, 1] and is a valid p-value for the testjj The set C n is obtained by inverting the hypothesis 
test, that is, C n consists of all values z that are not rejected by the test. It then follows that 
F(Z n+1 eC n )>l-a for all P. 



1 More Precisely, it is sub-uniform due to the discreteness. 



In Lei et al. (2011), the density p* n is obtained from kernel density estimators with band- 



width h. 



Lei et al.| (2011 ) show that C^ a ' is also efficient meaning that it is close to C^ with 



high probability where C^*' is the smallest set with probability content 1 — a as defined in 
Computing C^ a > is expensive since we need to find the the p- value vr n (z) for every z. 



Lei et al. (2011) proposed the following approximation C+ to C n — called the sandwich 



approximation — which avoids the augmentation step altogether but preserves finite sample 
validity. Let Z^, Z@), ■ ■ ■ , denote the data ordered increasingly by p(Zi). Let j = \na\ and 
define 

if (0)1 



C ' 



z : p(z) > p(Z (j) ) - 



nh d 



i 
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Lei et al. (2011 ) show that C^ C C+ and hence C+ also has finite sample validity. Moreover, 
C+ has the same efficiency properties as C n if h is chosen appropriately. This result, known 
as the "Sandwich Lemma" , provides a simple characterization of the conformal prediction 
set C^ a ' in terms of the plug-in density level set. In this paper, a specific version of the 
Sandwich Lemma for the conditional density is stated in Lemma[8j Thus, using the sandwich 
approximation we get a fast method for constructing a valid band, based on slicing the joint 
density. 

Now let Z = (X, Y). The x-slices of the joint region for Z define a marginally valid band. 
Specifically, let K x and K y be two kernel functions in M. d and M 1 , respectively. Consider the 
kernel density estimator: For any (u, v) G 



x 



Pn;X,Y(u,v) 



1 - 1 / 



u-Xj 
h„ 



K„ 



Y; 



h, 



(12) 



For any (x, y) e M^xM 1 , let (X, Y) = (X u Y x , . . . , X n , Y n ) be the data set and aug(X, Y; (x, y)) 
be the augmented data with X n+1 = x and Y n+ i = y. Define p^xr ^ e ^ ne kernel density 
estimator from the augmented data: 



— (a;,?/) / \ 

Pn;X,Y^ V ) 



n 



n + 1 
Define the conformity measure 



Pu;X,y(u,v) + 



[n + l)hi+i Kx 



u — X 



K„ 



y-y 



(Ti(x,y) :=j£*$ Y (Xi,Yd. 



and p- value 



■Ki 



, n+1 

—-r^2Hc r j(x,y)<a i (x,y)), for l<i<n+l. 



(13) 



(14) 



(15) 



Let a = \_{n + l)aj/(n + 1). Since (Xj, Yj)™^ are iid, by exchangeability, we have, for all i, 



(m > a) > 1 



a. 



(16) 





Algorithm 1. Sandwich Slicer Algorithm 


1. 

2. 
3. 


Let p(x, y) be the joint density estimator. 

Let Zi = (Xi, Yi) and let Z(i), Zm), ■ ■ ■ , denote the sample ordered increasingly 

by p[Xi,Yj. 

Let j = \na\ and define 




(%{x) = {y : P(x,y) > P(X U) ,Y {3) ) - ^gf^} • (17) 



Define 

d^(x) = {y:n n+1 (x,y)>a} 

where n n+ i = 7i n +i [aug(X, Y; (x,y))}. From ([T6]) we have: 
Lemma 4. C^^ar) is finite sample marginally valid: 



P 



Yn+l £ C (^n+1, 



> 1 - a for all P. 



Now we use the sandwich approximation to the joint conformal region for (X,Y). The 
resulting band C+(x) is obtained by fixing X = x and taking slices of the joint region and 
is then a marginally valid band. See Algorithm 1. 

To summarize: the band given in Algorithm 1 is marginally valid. But it is not efficient 
nor does it satisfy asymptotic conditional validity. This leads to the subject of the next 
section. 

3.2 Locally Valid Bands 

Now we extend the idea of conformal prediction to construct prediction bands with local 
validity. These bands will also be asymptotically efficient and have asymptotic conditional 
validity. For simplicity of presentation, we assume that supp(Px) = [0, l] d where supp(Px) 
denotes the support of Px and we consider partitions A = {A k , k > 1} in the form of cubes 
with sides of length w n . Let n k = Ym=i 1(^« £ A k ) be the histogram count. 

Given a kernel function K(-) : IR 1 i— >• IR 1 and another bandwidth h n , consider the esti- 
mated local marginal density of Y: 

4 = 1 x ' 

The corresponding augmented estimate is, for any (x, y) 64 x IR 1 , 

^ v \v\A k ) = -^-p(v\A k ) + * K ( V -^- 

n k + l (n k + l)h n V h n 



Algorithm 2: Local Sandwich Slicer Algorithm 

1. Divide X into bins A\, . . . , A m . 

2. Apply Algorithm 1 separately on all Yj's within each A k . 

3. Output C^(x): the resulting set of A k for all x G A k . 



For any (x,y) 6 4 x I 1 , consider the following local conformity rank 

, n+l 

n njk (x,y) = — VlI(X, e A k )JL [^\Y t \A k ) < ^(Y n+1 \A k )] , (19) 



=i 

which can be interpreted as the local conditional density rank. It is easy to check that 
the ir n>k (x,y) has a sub-uniform distribution if (X n+ i,Y n+1 ) = (x,y) is another independent 
sample from P. Therefore, the band 

C{x) = {7c n , k {x,y)>a} (20) 

for x G A k has finite sample local validity 

Proposition 5. For x G A k , let C{x) = {y : ^ n ,k{x,y) > ct], where ^ n ,k{x,y) is defined as 



in (19), then C(x) is finite sample locally valid and hence finite sample marginally valid. 

Proof. Fix k, let {zi, ■■■,i nk } = {i : 1 < i < n, X; L G A k }. Let (X n+1 ,F n+1 ) ~ P be another 
independent sample. Define i nk +i —n + l and cr^ = p^ x ' y \Y ie \A k ) for all 1 < £ < n k + 1. 
Then conditioning on the event X n+ \ G A& and (ii, ...,i nk ), the sequence (a^, ..., <7;„ , cr^ +1 ) 
is exchangeable. □ 

We call (7 the Conformal Optimized Prediction Set (COPS) estimator, where the word 
"optimized" stands for the effort of minimizing the average interval length KxC(X). 

We give a fast approximation algorithm that is analogous to Algorithm 1. The resulting 
approximation also satisfies finite sample local validity as well as asymptotic efficiency as 
shown in Section |4| See Algorithm 2. 

Remark 6. In the approach described above, the local conformity measure is jy> x ' y '(v \A k ). In 
principle one can use any conformity measure that does not need to depend on the partition 
A k , as long as the symmetry condition is satisfied. For example, one can use either the 
estimated joint density p^ x ' v \u,v) or the estimated conditional density p^ x ' y \v\u) . We note 
that when diam(v4fc) is small, these choices of conformity measure are close to each other 
since px(x) and p(-\x) change very little when x varies inside A k . 

Remark 7. Although one can choose any conformity measure, in order to have local valid- 
ity the ranking must be based on a local subset of the sample. When Ak is small and the 
distribution is smooth enough, the local sample (X ie : 1 < £ < n k ) approximates independent 
observations from p(-\X = x) for x G Ak, which can be used to approximate the conditional 
oracle Cp(x). 

10 



4 Asymptotic Properties 

In this section we investigate the asymptotic efficiency of the locally valid prediction band 



given in (20). The efficiency argument is similar for other choices of conformity measures, 
such as joint density or conditional density. Again, we focus on cases where supp(Px) = 
[0, l] d and A is a cubic histogram with width w n . The conformity measure is p^ x ' y \Yi\A 



k 



for x G Af., where p^ x ' y '(v\A k ) is defined as in equation (18) with kernel bandwidth h n . 

4.1 Notation 

In the subsequent arguments, Px(-) denotes the marginal density of X, p(y\x) the condi- 
tional density of Y given X = x, and p(y\Ak) the conditional density of Y given X E A k . The 
kernel estimator of p(y\A k ) is denoted by p(-\A k ) and P(-\A k ) is the empirical distribution 
of (Y\X e A k ). 

The upper and lower level sets of conditional density p(y\x) are denoted by L x (t) = {y : 
p(y\ x ) > t} and L x (t) = {y : p(y\x) < t}, respectively; L k (t), L l k (t) are the counterparts of 
L x (t) and L x (t), defined for p(-\A k ). As in the definition of conditional oracle, t x is solution 
to the equation P x (L x (t)) = 1 — a. Its existence and uniqueness is guaranteed if the contour 
{y : p(y\x) = t} has zero measure for all t > 0. Finally we let G x {t) = P x (L x (t)). 

4.2 The Sandwich Lemma 

Heuristically, p(y\A k ) w p{y\x) for x G A k when &idni(A k ) is small and p(y\x) varies 
smoothly in x. As a result, the estimated densities p^ x ' yS> {Yi\A k ) can be viewed as roughly a 
sample from p(Y\x), and hence C(x) approximates the conditional oracle Cp(x). First we 
show that C(x) can be approximated by two plug-in conditional density level sets (Lemma 
|8|. For a fixed A k G A, conditioning on (i 1; ...,i nk ), let (X^, a ), Y(k,a)) De the element of 
{(A^, YjJ, ..., (Xi n ,Yi n )} such that p(Y( k . a )\A k ) ranks L^fcod i n ascending order among all 
RYiMk), 1< j <n k . 



Lemma 8 (The Sandwich Lemma (Lei et al. , 2011)). For any fixed a G (0,1), if C(x) is 



defined in (20) and ||i^||oo = K(0), then C(x) is "sandwiched" by two plug-in conditional 



density level sets: 

L {p{X {k)a ),Y {k)a )\A k )) C C(x) C L (p(X( kta ),Y( k ,a)\A k ) - (n^/in)" 1 ^) , (21) 

where ipK = swp xx , \K{x) — K{x')\. 

The Sandwich Lemma provides simple and accurate characterization of C(x) in terms of 
plug-in conditional density level sets, which are much easier to estimate. The asymptotic 
properties of C(x) can be obtained by those of the sandwiching sets. 

4.3 Rates of convergence 

To show the asymptotic efficiency of C(x), it suffices to show efficiency for both sand- 
wiching sets in Lemma [8} We need regularity conditions to quantify and control the approx- 
imations p(y\x) k> p(y\A k ), p(y\A k ) « p(y\A k ), and L k (t) m L x (t). 
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The following assumption puts boundedness and smoothness conditions on the marginal 
density px, conditional density p(y\x), and its derivatives. 

Assumption Al (regularity of marginal and conditional densities) 

(a) The marginal density of X satisfies < po < px{x) < Pi < oo for all x. 

(b) For all x, p(-\x) is Holder class V(/3, L). Correspondingly, the kernel K is a valid kernel 
of order /3. 

(c) For any < s < \_/3\, p( s \y\x) is continuous and uniformly bounded by L for all x,y. 

(d) The conditional density is Lipschitz in x: \\p(-\x) — p(-|a;')|| 00 < L\\x — x'\\. 

The Holder class of smooth functions and valid kernels are common concepts in nonpara- 



metric density estimation. We give their definitions in Appendix 8.1 Assumptions Al(b) 
and Al(c) implies that p(-\A k ) is also in a Holder class and can be estimated well by kernel 
estimators. A2(d) enables us to approximate p(-\x) by p(-\A k ) for all x G A k . 

The next assumption gives sufficient regularity condition on the level sets L x (t). 

Assumption A2 (regularity of conditional density level set) 

(a) There exist positive constants e , 7, ci, c 2 , such that 

ci(h - hy < G x (t 2 ) - G x (h) < c 2 (t 2 - h)\ 

for all t x a) - e < h < t 2 < t x a) + e . 

(b) There exist positive constants to and C, such that < to < infx t x and fx(L x (t )) < C 
for all x. 



Assumption A2(a) is related to the notion of "7-exponent" condition introduced by Polonik 



(1995) and widely used in the density level set literature (Tsybakov, 1997 Rigollet & Vert 



2009). It ensures that the conditional density function p{-\x) is neither too flat nor too 



steep near the contour at level t x , so that the cut-off value t x and the conditional density 



level set Cp(x) can be approximated from a finite sample. As mentioned in Audibert & 



Tsybakov (2007), if Assumption Al(b) also holds, the oracle band Cp(x) is non-empty only 
if 7(/3 A 1) < 1, which holds for the most common case 7 = 1. Part (b) simply simply puts 
some constraints on the optimal levels as well as the size of the level sets. 



The following critical rate will be used repeatedly in our analysis. 

logrA /3(d+2)+i 



n 



(22) 



The rate may appear to be non-standard. This is because we are assuming difference 
amounts of smoothness on y and x. This seems to be necessary to achieve both marginal 
and local validity. We do not know of any procedure that uses a smoother construction 
and still retains finite sample validity. The next theorem gives the convergence rate on the 



asymptotic efficiency of the locally valid prediction band constructed in Subsection ^2 

12 



-\\ 



Theorem 9. Let C be the prediction band given by the local conformity procedure as described 
in (20). Choose w n x r n , h n x r n . Under Assumptions A1-A2, for any A > ; there exists 
constant A\, such that 

P (sup/x (c{x)AC P {x)) > AxrZ 1 } = 0{n 

where 71 = min(l,7). 

Thus, in the common case 7 = 1, the rate is r n . The following lemma follows easily from 
the previous result. 

Lemma 10. Under assumptions Al and A2, the local band is asymptotically conditionally 
valid. 

Remark 11. It follows from the proof that the output of Algorithm 2 also satisfies the same 
asymptotic efficiency and conditional validity results. 

4.4 Minimax Bound 

The next theorem says that in the most common case 7 = 1, the rate given in Theorem 
[9] is indeed minimax rate optimal. We define the minimax risk by 

inf sup Epfi \d(x)AC(x)} (23) 

where C nyCt is the set of all valid prediction sets, and V((3, L) is the class of distributions 
satisfying Al and A2 with 7 = 1. We can obtain a lower bound on the minimax risk by 
taking the infimum over all set estimators C, as in the following result. 

Theorem 12 (Lower bound on estimation error). Let V(/3,L) be the class of distributions 

on [0, l] d x M. 1 such that for each P G V((3,L), Px is uniform on [0, l] d , and satisfies 

Assumptions A1-A2 with 7 = 1. Fix an a G (0, 1), there exist constant c = c(a,/3, L, d) > 
such that 



inf sup Ep/x C(x)AC(x) 

C P£V(/3,L) 



> cr n . 
Hence, our procedure achieves the same rate as the lower bound and so is minimax 



rate optimal over the class V(/3,L). The proof of Theorem 12 is in Section 8.4 and uses a 
somewhat non-standard construction. 

5 Tuning Parameter Selection 



In the band given by (20), there are two bandwidths to choose: w n and h n . Note that 
since each bin A^ can use a different h n to estimate the local marginal density p(-\Ak), we 
can consider h ns k, allowing a different kernel bandwidth for each bin. 

Since all bandwidths give local validity, one can choose the combination of (w n , h n> k) such 
that the resulting conformal set has smallest Lesbesgue measure. Such a two-stage procedure 
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Algorithm 3: Bandwidth Tuning for COPS 
Input: Data Z, level a, candidate sets W, "H. 

1. Split data set into two equal sized subsamples, Z\, Z 2 . 

2. For each w £W 

(a) Construct partition A w . 

(b) For each k and h construct local conformal prediction set C^ fe , each at level 
1 — a, using data Z\. 

(c) Let h* wk = argmin fc6W /i(C^ fc ), for all fc. 

(d) LetQ(w) = iEfcM^i 1 ^)- 

3. Choose w = argminQ(w); h^ k = h~ k . 

4. Construct partition A w . For x G A k , output prediction band C(x) = C~ , where 
C 1 ^ is the local conformal prediction set estimated from data Z 2 in local set A k . 



of selecting w n and h n ^ from discrete candidate sets W = {w 1 , ...,w m } and H = {h 1 , ..., h } 
is detailed in Algorithm 3. To preserve finite sample marginal validity with data-driven 
bandwidths, we split the sample into two equal-sized subsamples, and apply the tuning 
algorithm on one subsample and use the output bandwidth on the other subsample to obtain 
the prediction band. 



Following Remark 
ciple, the above samp 



6J one can use different conformity measures to construct C. In prin- 
e splitting procedure works for any conformity measures. 

It is straightforward to show that the band C constructed as above using data-driven 
tuning parameters is locally valid and marginally valid, because the bandwidth (u>, h) used 
is independent of the training data Z 2 . From the construction of C, it will have small excess 
risk if the conformal prediction set is stable under random sampling. Then asymptotic 
efficiency follows if one can relate the excess risk to the symmetric difference risk. A rigorous 
argument is beyond the scope of this paper and will be pursued in a separate paper. 

6 Data Examples 

In this section we apply our method to some examples. 

6.1 A Synthetic Example 

The procedure is illustrated by the following example in which d = 1, and 

X~ Unif[-1.5,1.5], 
(Y\X = x)~ 0.5A [f(x) - g(x), a 2 (x)} + 0.5A [f(x) + g(x), a 2 (x)} , l ' 
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where 

f(x)=(x-l) 2 (x + l), 



g{x) =2 v / xT05 x I(x > -0.5), 

a 2 (x) =1/4+ \x\. 

For x < —0.5, (Y\X = x) is a Gaussian centered at /(x) with varying variance a 2 (x). For 
x > —0.5, (Y\X = x) is a two-component Gaussian mixture, and for large values of x, the 
two components have little overlap. 

The performance of prediction bands using local conformity is plotted and compared 
with the marginal valid band in Figure |3j with n = 1000, a = 0.1. The conformity measure 
used here is p^ x,y '(Yi\Xi). The locally valid prediction band is constructed by partitioning 
the support of Px into 10 equal sized bins, whereas the marginally valid band is constructed 
by a global ranking with the same conformity measure. We see that although the locally 
valid band has larger Lebesgue measure, it gives the desired coverage for all values x. The 
marginally valid band over covers for smaller values of x, and under covers for larger values 
of x. We also plot the effect of bandwidth on the size of prediction set (lower left panel of 
Figure |3J). 



6.2 Car Data 

Next we consider an example on car mileage. The original data contains features for about 
400 cars. For each car, the data consist of miles per gallon, horse power, engine displacement, 
size, acceleration, number of cylinders, model year, origin of manufacture. These data have 



been used in statistics text books (for example, DeGroot & Schervish (2012), Chapter 11 



to illustrate the art of linear regression analysis. Here we reproduce the linear model built 



in Example 11.3.2 of DeGroot & Schervish (2012), where we want to predict the miles per 



gallon by the horse power. Clearly, the relationship between miles per gallon and horse 
power is far from linear (Figure 111) so some transformation must be applied prior to linear 
model fitting. It makes sense to assume, both from intuition and data plots, that the inverse 
of miles per gallon, namely, gallons per mile, has roughly a linear dependence on the horse 
power. 

In the right panel of Figure [4] we plot the level 0.9 prediction band obtained from the 
linear regression prediction band. The overall coverage is reasonably close to the nominal 
level. However, due to the non-uniform noise level, the band is too wide for small values of 
horse power and too narrow for large values. In the left panel, we plot the nonparametric 
conformal prediction band using conformity measure p^(Y;|Xj) to enhance smoothness of 



the estimated band. Such a band is asymptotically close to the one given in (20). The 
bandwidths are h x = 14 and h y = 1.4. The partition A is constructed by partitioning 
the range of horse power into several intervals to ensure each set Ak contains roughly same 
number of sample points. Here the tuning parameter is the number of partitions and is set 

to 8. 

The advantage of our method is clear. First, it automatically outputs good prediction 
bands without involving choosing the variable transformation. The tuning parameters can 
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Figure 3: Conditional and marginal prediction bands. The bottom left panel shows the 
relationship between bandwidth and Lebesgue measure of the prediction band. The bottom 
right panel shows the conditional coverage of the estimated set C(x) as a function of x. 
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Figure 4: Level .9 prediction bands using local conformal prediction (left) and linear regres- 
sion with variable transformation (right). 

be chosen in either an automated procedure as described in Algorithm 3, or by conven- 
tional choices (kernel bandwidth selectors). Second, the conformal prediction band is truly 
distribution free, with valid coverage for all distribution and all sample sizes. 

7 Final Remarks 

We have constructed nonparametric prediction bands with finite sample, distribution 
free validity. With regularity assumptions, the band is efficient in the sense of achieving the 
minimax bound. The tuning parameters are completely data-driven. We believe this is the 
first prediction band with these properties. 

An important open question is to establish a rigorous result on the asymptotic efficiency 
for the data-driven bandwidth. A sketch of such an argument can be given by combining two 
facts. First, the empirical average excess loss n _1 ^ fe ^//(C/^) is a good approximation to 

the excess risk E 



J fj,(C htk {x))P x (dx) 



for all w and h. This problem is technically similar 



to those considered by Rinaldo et al. (2010) in the study of stability of plug-in density level 



sets and prediction sets. Second, one can show that the excess risk provid es an upp er bound 



Lei et al. 



(2011) (see also Scott fc 



of the symmetric difference risk E(CACp), as given in 
Nowak|fl2006| )). 

The bands are not suitable for high-dimensional regression problems. In current work, we 
are developing methods for constructing prediction bands that exploit sparsity assumptions. 
These will yield valid prediction and variable selection simultaneously. 
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8 Appendix 

In the appendix, we give supplementary technical details. 

8.1 Technical Definitions 

Now we give formal definitions of some technical terms used in the asymptotic analysis, 
including Holder class of functions and valid kernel functions of order (3. These definitions can 



be found in standard nonparametric inference text books such as (Tsybakov, 2009, Section 
1.2). 

Definition 13 (Holder Class). Given L > 0, (3 > 0. Let £ = [(3\ . The Holder class E(/3, L) 
is the family of functions f : R 1 \— > R 1 whose derivative f^' satisfies 

\f i£) (x) - f {e) (x')\ < L\x - x'f~ e , V x,x'. 

Remark: If / G £(/?, L), then / can be uniformly approximated by its local polynomials 
of order £. Define 






f$(x) = fM + f(x )(x - x ) + .... + J —^{x - x ) e . 



Then 

\f(x)-fS(x)\<^\x-x \P. 

Definition 14 (Valid Kernels of Order (3). Let (3 > and £= [(3\. Say that K : R 1 H- R 1 is 
a valid kernel of order (3 if the functions u >->■ v?K{u), j = 1, ..., £, are integrable and satisfy 



K(u)du = 1, / u j K(u)du = 0, j = 1, ..., £. 

Remark: The relationship between a Holder class S(/3, L) and a valid kernel K of order 
(3 is that for any p G S(/3, L), and h = o(l), \\p — p*Kh\\oc < §^> where * is the convolution 
operator and Kh(x) = h^ 1 K{x/h). 

8.2 Proof of Lemma u\ 

Proof of Lemma\l\ For simplicity we prove the case where d = 1. Let 

TV{P,Q) = mw\P(A)-Q(A)\ 

A 

denote the total variation distance between P and Q. Given any e > define 

6 n = 2[l-(l-e 2 /8) 1 /"]. 



From Lemma A.l of |Donoho| fll988fr , if TV(P, Q) < e n then TV(P n , Q n ) < e. 

Fix e > 0. Let x be a non-atom and choose 5 be such that < Px [B(xq, 8)] < e n where 
e n = 2[1 - (1 - e*/8)V n ]. It follows that TV(P n , Q n ) < e. 
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Fix B > and let Bq = 5/(2(1 — a)). Define another distribution Q by 

Q{A) = P{An S c ) + U{An S) 
where 



S = {{x,y) : x eB(x ,5), y e r\ 



and U has total mass P(S) and is uniform on {(x,y) : x G 5(s ,5), \y\ < B }. Note that 
P(S) > 0, Q(S) > and TV(P, Q) < e n . It follows that TV(P n , Q n ) < e. 

Note that, for all x G -B(x ,5), J c , x sq(y\x)dy > 1 — a implies that /j,[C(x)] > 2(1 — a)B = 



B. Hence, 



Thus, 



Q n [ ess sup fJt[C(x)] >B 

x£B(xq,8) 



P n ess sup fi[C(x)] > B > Q n ess sup /i[C(x)] > B - e = 1 - e. 

Va;eB(a;o,<5) 7 \a;eB(a;o,5) J 

Since e and P are arbitrary, the result follows. □ 

8.3 Proofs of asymptotic efficiency 

Lemma 15. Given A > 0, under condition A2 and A4, there exists numerical constant £\ 
such that, 

P (sup \\p(-\A k ) -KWIloo > &r„) = 0(n- A ). 
Proof, for any fixed fe, 5^ 1( ..., l^ n is a random sample from P(y\Ak) conditioning on n k . 



Let pd/|^4fc) be the convolution density p(-\A k ) *Kh n (-), then using a result from Gine h 



Guillou ( |2002[ ), there exists numerical constants C±, C 2 and £o such that for all £ > £o, 

F (\\p{-\A k ) -p(-|A fc )|U > eVlogn fe /(n fc /i n )) < d/£< (25) 

On the other hand, by Holder condition of p(y\x) and hence on p(-|Afc), we have: 

\\p(-\A k ) - pi-lAk)^ < Lhi 
Put together with union bound on all A k G A n 

F(3k: \\p{-\A k ) -p(-|,4 fe )lloo > Wlogn k /{n k h n ) + Lhty <C 1 h c n ^w~ d . 

Consider event E : 

E = {hnwi/2 <n k < 3b 2 nw d /2}, 



where the constants b\ and b 2 is defined as in Assumption Al(a). By Lemma 20 we have 

nE c )<C 3 w~ d exp(-C 4 nw d n ), 
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with constants C3, C4 denned in lemma 20 
On Eq and for n large enough we have 



logn fc <2 I 2/3 4 1 
n k h n ~ 



logn 



d [/3(d + 2) + 1] V nw% 



Note that under Assumption A4, \/ n °dL = h^ = r n . 
Let 



6 = 2 1 



2/3 + 1 



ci [/3(d + 2) + 1] 



'A(/3(d + 2) + l)+/3d 



V& +A 



where the constant c\ is defined in Assumption Al(a), C2 defined in equation (25), and L 
defined in A2(a). 

Then we have 

P (sup \\p{-\A k ) -p(-L4 fe )|U < £\r, 

>P(sup||p(-|A fc )-p(-|A fc )|| 00 <(6-L)W^|^ + ^, E ) 

>P I 8U P ||p(.|A fe ) -p(.|A»)|U < 7== \f^ + Lh n, E 



ci[/3(d+2)+l] 



>1 - P ( 3* : ||p(-|^) -pOWIU > ft L a/ 1 ^ + Mf 



2/3+1 V ^fc^n 



ci[/3(d+2)+l] 



PM 



=1 - 0(n 



D 

Corollary 16. Lei R n (x) = \\p n (y\A k ) — p(|/|a;)||oo, ^en /or any A > 0, t/iere exzsfe £1^ > 
suc/i £/m£ 

sup i? n (x) > £i iA r n = 0(n~ x ). 

_x&B k 



P 



Proof. First by Lipschitz condition A2(c) on p(y\x), 

\\p(y\A k ) -p(y\x)\\oo < VdLw n . 



Note that w n = r n , the claim then follows by applying Lemma 15 and choose £ 1)A = 
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Lemma 17. Let 



V n (x) = sup P(L e x (t)\A k ) - P(L e x (t)\x) 
t>t Q 



then, for any X > 0, there exists £2, a such that 



F(supV n (x)>&. x r^ 



0(n 



-A\ 



with 71 = min(7, 1). 

Proof. Consider a fixed A k and an x G A k . Note that {L e x (t) : t > to} is a nested class of 
sets with VC dimension 2. By classical empirical process theory, for all B > we have 

-(B 2 /32-2) 



P sup 



P(Ll(t)\A k ) - P(Ll(t)\A k 



>*/!** | <<w 



(26) 



for some universal constant Co- 
On the other hand 



\P(Li(t)\A k ) - P(Li(t)\ 



x 



(27) 



(p(y\A k ) ~p(y\x))dy 

Li(t) 

< \fdLw n n(L x (t)) 

< VdLw n n(L x (t )) 

< CLvdiVn, 

where the constant C is defined in Assumption A3(b). 

Note that on E we have ^J\ogn k /n k = o(r n ) and hence ^J\ogn k /n k < r n for n large 
enough. 

Consider any x' £ A k . 

P(LUt)\A k ) - P(LUt)\x') 

P(L«At)\A k ) - P(Li(t)\A k ) + PU',U)\A,) P{L',U)\.r) 



< 



<M'\Ak)\UlM {L e x (t)ALi,(t)) + V n (x) + \G x {t) ~ G x ,(t)\ 



<\\p{-\Ak) 



c 2 (2L)' 



-w: 



V n (x) + CVdLw n + 



c 2 TUi+ x 



-w' 



\P(Li(t)\x) - P(Li,(t)\x')\ 

(28) 



where the last step uses Lemma 18 to control jj, (Lf,(t)AL^,,(t)) and G x (t) — G x >(t). 



Lemma 15 implies that, except for a probability of 0(n A ), sup fc ||p(-|^4fc) 



= L + o{V 
with L defined in A2(b). Combining (26), (27), and (28), we have, for some constant £2, a 

P (supV n (x) > & x r-A = 0(n- x ), 

where 71 = min(7, 1). 



□ 
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Lemma 18. Under assumptions A1-A3, 



sup sup \G x (t)-G x ,(t)\ = 0(w2 M ). 

k t>to,x,x'£Ak 



Proof. 



L x (t)AL x ,(t) 



={y 
={y 
t{y 



p{y\x) > t,p{y\x') < t} U {y : p{y\x) < t,p(y\x') > t} 

t < p{y\x) < t + Lw n ,p(y\x') < t} U {y : t - Lw n < p{y\x) < t,p(y\x') > t} 

t - Lw n < p(y\x) < t + Lw n }, (29) 



where the first step uses the fact that ||p(-|a;) — p(-|^')lloo < L\\x — x'\\ and the constant L 
is from Assumption A2(c). 

\G x (t)-G x ,(t)\ 
< \P(Li(t)\x) - P(Li(t)\x')\ + \P(Li(t)\x') - P(Li,(t)\x')\ 
= \P(L x (t)\x) - P(L x (t)\x')\ + \P(Li(t)\x f ) - P(L x ,(t)\x')\ 

<^L x (t))\\ P (-\x) -pM^OIIoo + lip(V)IUM4(0A4,(i)) 



<CVdLw n + L 



G x i(t + Lw n ) — G x '(t — Lw r , 



„ n T c 2 2^L^ +1 
<CVdLw n + 



W; 



(30) 



where the constant L is from Assumption A2 and (c 2 , C, 7) are defined in Assumption A3. □ 



We complete the argument using Cadre et al. (2009) and Lei et al. (2011). 



Lemma 19. Fix a > and to > 0. Suppose p is a density function satisfying Assumption 
A3 (a). Let p be an estimated density such that \\p— p\\<x> < u i> and P be a probability measure 
satisfying sup t > <0 P{L l {t)) - P(L e (t)) < v 2 . Define ^ = inf{£ > : P{L l {t)) > a}. If 

Vi, v 2 are small enough such that V\ + c[ /7 z/ 2 ' 7 < t^ — t and c[ /7 V 2 ' 7 < e (where c\, 7 
are constants in Assumption A3 (a)), then 



raa) _ t («) I <^ + ^1/7^1/7. 



(31) 



Moreover, for any v^ such that \t^ — t^°^\ < z/3, if2ui + c 1 v 2 + U3 < €q, then there 
exist constants £1, C, 2 o,nd £3 such that 



p (L@W)AL(t< a >)) < ^y< + £ 2U2 + fry. 



argument used in Cadre et al. (2009). 



Proof. The proof follows essentially from Lei et al. (2011 ), which is a modified version of the 
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For t > to, let L e (t) = {y : p(y) < £}. By the assumptions in the lemma we have 

L\t-v x ) a'(()a'((^!) 

=> P{L\t - j/0) < P(L\t)) < P{L\t + v x )) 

=> P(L\t - Vl )) -u 2 < P(P(t)) < P(L\t + u x )) + v 2 . 

Hence, 

p(L\t^ -v x - q 1/7 z/ 2 1/7 )) < P(L e (t^ - c- lh v l 2 h )) + v 2 <a, 

where the last step uses Assumption A3 (a). 

Therefore, we must have tS a ^ > t^ — V\ — c\~ v 2 . A similar argument gives v^ < 
f( a ) _|_ Vl _|_ c ~ /7 // 2 ' 7 . This proves the first part. 

For the second part, note that 

L(*»)AL(^) = {y : p{y) > V a \p(y) < t&} U {y : p(y) < ¥ a \p(y) > t (o) }. 
By the assumption on ft a > and the first result, 

{p(v) > t (a) } c {p(v) > i(a) - 2^1 - q 1/7 z/ 2 1/7 - i*}, 

{P\V) < V a) } C {p(y) < *<«) + 2^ + c- 1 ^ 7 + z/ 3 }. 
As a result, 

H (l(£^)AL(£^)) < /* ({„ : [p(y) - i<«>| < 2v x + c^ 7 ^ 7 + ^3}) 

< io x c 2 (4^ + 2q l77 z/ 2 1/7 + 2z/ 3 ) 7 < W + 6^2 + W, 

where (£1, £ 2 , £3) are functions of (t , c x , c 2 , 7). D 

Proof of Theorem [$| The proof is based on a direct application of Lemma [19] to the density 
p{-\x) and the empirical measure P(-\A k ) and estimated density function p(-\A k ). 

Here we use L for upper level sets of p(-\A k ) and omit the dependence on k. 
Conditioning on (i 1; ...,z nfe ), then one can show that the local conformal prediction set 
C^ a '(x) is "sandwiched" by two estimated level sets: 

L (p(X (la) ,Y (la) \A k )) C C^\x) C L (p(X (iQ) ,r (jQ) |A fc ) - (n*/»n)-tyjO , 

where ^ = sup^. x / |-K"(a;) — ^(a/)]. So the asymptotic properties of C^ a \x) can be obtained 
by those of the sandwiching sets. 

Recall that (X {ia) , F (ia) ) is the element of {(X h ,Y h ), ..., (X ink ,Y ink )} such that p(Y (ia) |^4 fc ) 
ranks \n k a\ in ascending order among all p(Yi\A k ), 1 < j < n k . Let tS a ^ = p(X( ia ), Y(j Q )). 
It is easy to check that 



V a) = inf it > : P (Z e (t)\A k ) > a\ . 
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Consider event 

E x = <J supi? n (x) < £ h \r n , supV n (a;) < 6,A r n 1 



where £1 and £2 are defined as in the statement of Corollary [16] and Lemma [17J We have 
P(^) = 0(n- x ). 

Let v\ = ^i^Tn, v 2 = ^2,A r ^- Note that r n — > as n — >■ 00, so for n large enough, we have 



z/i and z/2 satisfying the requirements in Lemma 19 Let z/3 = in this case, then we have, 
for some constants £[ A , £ 2 A , that 



which is equivalent to 

P (sup^ (l^)AL x {t^)) > £> x rA = 0{ 



a A ), 



a A ), 



for some constant £ A independent of n. 

Now let £*") = ? a ) — (nkhn^ipK- Applying Lemma 19 with ^3 = z/ 3i „ = {nkh n )~ l il)K- l we 
obtain, for some constants £" A , j = 1, 2, 3, 

Note that on E , z/ 3ra = o(r n ), so the above inequality reduces to 

P (/i (L^)AL x (t^)) > e A V^) = 0(n- A ). 

The conclusion of Theorem [9] follows from the sandwiching property: 

/i (aW(a;)AL,(tW)) < /x (Z P q) ) AL,(#>)) + /i (Z (? q) ) AL,(tM) 

where ^ = p(X {ia) , Y {ta) ) and t^ = ^ - (n k h n )- l ^ K - □ 

Lemma 20 (Lower bound on local sample size). Under assumption Al: 

P (Vife : &im^/2 < n fc < 36 2 n</2) > 1 - C 3 w- d e- Cmw *, 

where C3 = 2 [Diam(supp(Px))] and C4 = 6^/(862 + 46i/3) with b\, 62 defined in Assumption 
Alfa). 

Proof. Let pk = Px{Ak). Use Bernstein's inequality, for each k, 

t 2 /2 



P(|n fc -np k \ >t)< exp 



np k (l-p k ) +t/3 / 
The result follows by taking t = cinw^/2 and union bound. □ 
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8.4 Proof of Theorem 12 



In the following proof we focus on the rate and ignore the tuning on constants. The proof 
uses Generalized Fano's Lemma and the construction follows these key steps. 

1. Let the marginal of X be uniform on [0, l} d . Divide [0, l] d into cubes of size w > 0. 

2. Choose a density function po(y) such that: 

(a) po{y) is symmetric and Holder smooth of order (3. 

(b) There exists y < and 5 > 0, such that p' (y) = 1 for all y G (y — 5, y + 5). 

3. For x G Aj, let Xj be the center of Aj. Define conditional density: 



p{y\x) =p(y,x-Xj) =p (y)+h(x-Xj)K 



y-yo 

hP(x — Xj) 



■h(x — Xj)K 



y + yo 

hP(x — Xj) 



where h(x) is a function defined on R d with support on [— w/2,w/2] d , attaining its 
maximum at 0, and satisfying ||/i'||oo < M < oo, h'(x) = for ||x||oo > w/2. In partic- 
ular, take h(x) = wr](2x/w), where r](x) is a rf-dimensional kernel function supported 
on [—1, l] d . It is easy to verify that the following conditions hold: 

(a) p(-\x) is a density function for all x. 

(b) p{y\x) is Holder smooth of order (3. This can be verified by noting that both po 



and h{x — Xj)K 



y-yo 



are Holder smooth of order /3. 



/l? (x — Xj) 

(c) |p(y|aj) ~ p(y\%')\ < L\\x — x'\\. This can be verified by noting that 



d 



p(y\x) 



dx 
ti(x)K 



y-yo 



<\\h'\ 



h l /P{x - Xj) 
H-H^llooll^ 

\\K\\ 4- Wh'W \\K' 



- h{x)K' 



y-yo 



— " 1 1 oo 1 1 •**■ 1 1 oo 



h x IP{x-Xj)) (3 

iy-yo)h~Ti(x-Xj) 



h f 1 (x-x j )(y-y ) 



4. For j = 1, ...,w , let Pj be the distribution of (X, Y) such that 

(a) Px is uniform. 

(b) Pj(y\X = x)= p (y) for x £ Aj. 

(c) Pj(y\X = x) = p(y\x) for x G Aj. 

We can verify that the Lipschitz condition \p(y\x) — p{y\x')\ < L\\x — x'\\ still holds if 
we require h! = on the border of the histogram cube. 
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5. (Pairwise separation) For i ^ j, The conditional density level sets at Po(yo) differ at 
least ch for some constant c (Consider pj(y\X = Xj) and Pi(y\X = Xj) and note that 
they corresponds to the same level a as prediction bands). 

6. (K-L divergence) Let h = h(0). Condition (b) in step 2 implies that there exists a 
constant c > such that vn£ y .\ y _ yo \ <h i/ p p {y) > c for h small enough. For any i ^ j, 

hg piMx-) My)dy 

log 1 + f-r Po(y)dy 

log l ^ po(y)rfy 

< _ /i*** 1 " f hK{{y-y )/hW) _ h 2 K 2 ((y - y )/h^) \ 

Jyo-hW \ Po(y) p 2 o(y) J 

<-/i 2+ ? / K 2 (u)du = Ch 2+ k 



( K 2 (u)du = Ch 2 



As a result 

tfl^HPj) < Ch 2+L Pw d . 

7. Using the generalized Fano's lemma (see also Tsybakov (2009, Chapter 2)): 

inf sup E P sup „(£(*) A<7(a;)) > £ (l - Cnh2+ ]f + lo & 2 ) , (32) 

c p x 2 \ -dlogw J 

where the supremum is over all P such that p(y\x) is Lipschtiz in x in sup-norm sense, 
and p(y\x) is Holder smooth of order /3. 

Choosing h = w = c{\ogn/n) 1 ^ d+2+1 ^ 13 ^ with constant c small enough, we have 

i 

•^ / lo£f Tl, \ d+2+i, 

infsupE P sup/x(C(x)AC(a;)) >d ' S ' 



Note that the choice ftxtois required by the condition 

hK(0) = p(yo\xj) - p(y \xj + w/2) < Lw. 
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